20 research outputs found
Containment of Shape Expression Schemas for RDF
We study the problem of containment for shape expression schemas (ShEx) for
RDF graphs. We identify a subclass of ShEx that has a natural graphical
representation in the form of shape graphs and their semantics is captured with
a tractable notion of embedding of an RDF graph in a shape graph. When applied
to pairs of shape graphs, an embedding is a sufficient condition for
containment, and for a practical subclass of deterministic shape graphs, it is
also a necessary one, thus yielding a subclass with tractable containment.
While for general shape graphs a minimal counter-example i.e., an instance
proving non-containment, might be of exponential size, we show that containment
is EXP-hard and in coNEXP. Finally, we show that containment for arbitrary ShEx
is coNEXP-hard and in coTwoNEXP^NP
Management of Inconsistencies in Data Integration
Data integration aims at providing a unified view over data coming from various sources. One of the most challenging tasks for data integration is handling the inconsistencies that appear in the integrated data in an efficient and effective manner. In this chapter, we provide a survey on techniques introduced for handling inconsistencies in data integration, focusing on two groups. The first group contains techniques for computing consistent query answers, and includes mechanisms for the compact representation of repairs, query rewriting, and logic programs. The second group contains techniques focusing on the resolution of inconsistencies. This includes methodologies for computing similarity between atomic values as well as similarity between groups of data, collective techniques, scaling to large datasets, and dealing with uncertainty that is related to inconsistencies
RDF Graph Alignment with Bisimulation
We investigate the problem of aligning two RDF databases, an essential
problem in understanding the evolution of ontologies. Our approaches address
three fundamental challenges: 1) the use of "blank" (null) names, 2) ontology
changes in which different names are used to identify the same entity, and 3)
small changes in the data values as well as small changes in the graph
structure of the RDF database. We propose approaches inspired by the classical
notion of graph bisimulation and extend them to capture the natural metrics of
edit distance on the data values and the graph structure. We evaluate our
methods on three evolving curated data sets. Overall, our results show that the
proposed methods perform well and are scalable
Learning Schemas for Unordered XML
We consider unordered XML, where the relative order among siblings is
ignored, and we investigate the problem of learning schemas from examples given
by the user. We focus on the schema formalisms proposed in [10]: disjunctive
multiplicity schemas (DMS) and its restriction, disjunction-free multiplicity
schemas (MS). A learning algorithm takes as input a set of XML documents which
must satisfy the schema (i.e., positive examples) and a set of XML documents
which must not satisfy the schema (i.e., negative examples), and returns a
schema consistent with the examples. We investigate a learning framework
inspired by Gold [18], where a learning algorithm should be sound i.e., always
return a schema consistent with the examples given by the user, and complete
i.e., able to produce every schema with a sufficiently rich set of examples.
Additionally, the algorithm should be efficient i.e., polynomial in the size of
the input. We prove that the DMS are learnable from positive examples only, but
they are not learnable when we also allow negative examples. Moreover, we show
that the MS are learnable in the presence of positive examples only, and also
in the presence of both positive and negative examples. Furthermore, for the
learnable cases, the proposed learning algorithms return minimal schemas
consistent with the examples.Comment: Proceedings of the 14th International Symposium on Database
Programming Languages (DBPL 2013), August 30, 2013, Riva del Garda, Trento,
Ital
Characterizing XML Twig Queries with Examples
International audienceTypically, a (Boolean) query is a finite formula that defines a possibly infinite set of database instances that satisfy it (positive examples), and implicitly, the set of instances that do not satisfy the query (negative examples). We investigate the following natural question: for a given class of queries, is it possible to characterize every query with a finite set of positive and negative examples that no other query is consistent with.We study this question for twig queries and XML databases. We show that while twig queries are characterizable, they generally require exponential sets of examples. Consequently, we focus on a practical subclass of anchored twig queries and show that not only are they characterizable but also with polynomially-sized sets of examples. This result is obtained with the use of generalization operations on twig queries, whose application to an anchored twig query yields a properly contained and minimally different query. Our results illustrate further interesting and strong connections between the structure and the semantics of anchored twig queries that the class of arbitrary twig queries does not enjoy. Finally, we show that the class of unions of twig queries is not characterizable
Bounded repairability for regular tree languages
International audienceWe consider the problem of repairing unranked trees (e.g., XML documents) satisfying a given restriction specification R (e.g., a DTD) into unranked trees satisfying a given target specification T. Specifically, we focus on the question of whether one can get from any tree in a regular language R to some tree in another regular language T with a finite, uniformly bounded, number of edit operations (i.e., deletions and insertions of nodes). We give effective characterizations of the pairs of specifications R and T for which such a uniform bound exists, and we study the complexity of the problem under different representations of the regular tree languages (e.g., non-deterministic stepwise automata, deterministic stepwise automata, DTDs). Finally, we point out some connections with the analogous problem for regular languages of words
Interactive Inference of Join Queries
We investigate the problem of inferring join queries from user interactions. The user is presented with a set of candidate tuples and is asked to label them as positive or negative depending on whether she would like the tuples as part of the join result or not. The goal is to quickly infer an arbitrary n-ary join predicate across two relations by keeping the number of user interactions as minimal as possible. We assume no prior knowledge of the integrity constraints between the involved relations. This kind of scenario occurs in several application settings, such as data integration, reverse engineering of database queries, and constraint inference. In such a scenario, the user has either a partial knowledge of the database schemas or the database instances are too big to be skimmed. We explore the search space by using a set of strategies that let us prune what we call “uninformative” tuples, and directly present to the user the informative ones i.e., those that lead to quickly find the goal query that the user has in mind. In this paper, we focus on the inference of joins with equality predicates and we show that for such joins deciding whether a tuple is uninformative can be done in polynomial time. Next, we propose several strategies for presenting tuples to the user in a given order that lets minimize the number of interactions. We show the efficiency and scalability of our approach through an experimental study on both benchmark and synthetic datasets. Finally, we prove that adding projection to our queries makes the problem intractable. 1
Learning Join Queries from User Examples
International audienceWe investigate the problem of learning join queries from user examples. The user is presented with a set of candidate tuples and is asked to label them as positive or negative examples, depending on whether or not she would like the tuples as part of the join result. The goal is to quickly infer an arbitrary n-ary join predicate across an arbitrary number m of relations while keeping the number of user interactions as minimal as possible. We assume no prior knowledge of the integrity constraints across the involved relations. Inferring the join predicate across multiple relations when the referential constraints are unknown may occur in several applications, such as data integration, reverse engineering of database queries, and schema inference. In such scenarios, the number of tuples involved in the join is typically large. We introduce a set of strategies that let us inspect the search space and aggressively prune what we call uninformative tuples, and we directly present to the user the informative ones that is, those that allow the user to quickly find the goal query she has in mind. In this article, we focus on the inference of joins with equality predicates and also allow disjunctive join predicates and projection in the queries. We precisely characterize the frontier between tractability and intractability for the following problems of interest in these settings: consistency checking, learnability, and deciding the informativeness of a tuple. Next, we propose several strategies for presenting tuples to the user in a given order that allows minimization of the number of interactions. We show the efficiency of our approach through an experimental study on both benchmark and synthetic datasets
Simple Schemas for Unordered XML
We consider unordered XML, where the relative order among siblings is ignored, and propose two simple yet practical schema formalisms: disjunctive multiplicity schemas (DMS), and its restriction, disjunction-free multiplicity schemas (MS). We investigate their computational properties and characterize the complexity of the following static analysis problems: schema satisfiability, membership of a tree to the language of a schema, schema containment, twig query satisfiability, implication, and containment in the presence of schema. Our research indicates that the proposed formalisms retain much of the expressiveness of DTDs without an increase in computational complexity. 1
Interactive Join Query Inference with JIM
National audienceSpecifying join predicates may become a cumbersome task in many situations e.g., when the relations to be joined come from disparate data sources, when the values of the attributes carry little or no knowledge of metadata, or simply when the user is unfamiliar with querying formalisms. Such task is recurrent in many traditional data management applications, such as data integration, constraint inference, and database denormalization, but it is also becoming pivotal in novel crowdsourcing applications. We present Jim (Join Inference Machine), a system for interactive join specification tasks, where the user infers an n-ary join predicate by selecting tuples that are part of the join result via Boolean membership queries. The user can label tuples as positive or negative, while the system allows to identify and gray out the uninformative tuples i.e., those that do not add any information to the final learning goal. The tool also guides the user to reach her join inference goal with a minimal number of interactions